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Abstract 

We use seven machine learning algorithms for 
one task: identifying base noun phrases. The 
results have been processed by different system 
combination methods and all of these outper- 
formed the best individual result. We have ap- 
plied the seven learners with the best combina- 
tor, a majority vote of the top five systems, to a 
standard data set and managed to improve the 
best published result for this data set. 

1 Introduction 

Van Halteren et al. 



([1998|) and Brih and Wu 



(1998) show that part-of-speech tagger perfor- 
mance can be improved by combining different 
taggers. By using techniques such as majority 
voting, errors made by the minority of the tag- 
gers can be removed. Van Halteren et al. ( |1998| ) 
report that the results of such a combined ap- 
proach can improve upon the accuracy error of 
the best individual system with as much as 19%. 
The positive effect of system combination for 
non-language processing tasks has been shown 
in a large body of machine learning work. 

In this paper we will use system combination 
for identifying base noun phrases (baseNPs). 
We will apply seven machine learning algo- 
rithms to the same baseNP task. At two points 
we will apply combination methods. We will 
start with making the systems process five out- 
put representations and combine the results by 
choosing the majority of the output features. 
Three of the seven systems use this approach. 
After this we will make an overall combination 
of the results of the seven systems. There we 
will evaluate several system combination meth- 
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ods. The best performing method will be ap- 
plied to a standard data set for baseNP identi- 
fication. 

2 Methods and experiments 

In this section we will describe our learning task: 
recognizing base noun phrases. After this we 
will describe the data representations we used 
and the machine learning algorithms that we 
will apply to the task. We will conclude with 
an overview of the combination methods that 
we will test. 

2.1 Task description 

Base noun phrases (baseNPs) are noun phrases 
which do not contain another noun phrase. For 
example, the sentence 

In [ early trading ] in [ Hong Kong ] 
[ Monday ] , [ gold ] was quoted at 
[ $ 366.50 1 [ an ounce 1 . 



contains six baseNPs (marked as phrases be- 
tween square brackets). The phrase $ 366.50 
an ounce is a noun phrase as well. However, 
it is not a baseNP since it contains two other 
noun phrases. Two baseNP data sets have been 
put forward by Ramshaw and Marcus ( |1995| ). 
The main data set consist of four sections of 
the Wall S treet Journal (WSJ) part of the Penn 
Treebank (|Marcus et al., 19931) as training ma- 
terial (sections 15-18, 211727 tokens) and one 
section as test material (section 20, 47377 to- 
kens)0 The dat a contains words, their part-of- 



^This Ramshaw and Marcus (1995) baseNP data set 



is available via ftp://ftp.cis.upenn.edu/pub/chunker/ 



speech (POS) tags as computed by the Brill tag- 
ger and their baseNP segmentation as derived 
from the Treebank (with some modifications). 

In the baseNP identification task, perfor- 
mance is measured with three rates. First, 
with the percentage of detected noun phrases 
that are correct (precision). Second, with the 
percentage of noun phrases in the data that 
were found by the classifier (recall). And third, 
with the F/3=i rate which is equal to (2*preci- 
sion*recall)/(precision+recall). The latter rate 
has been used as the target for optimization. 

2.2 Data representation 



In our example sentence in section 2.1, noun 



phrases are represented by bracket structures. 
It has been shown by Muhoz et al. ( 1999| ) 
that for baseNP recognition, the representa- 
tion with brackets outperforms other data rep- 
resentations. One classifier can be trained to 
recognize open brackets (O) and another can 
handle close brackets (C). Their results can be 
combined by making pairs of open and close 
brackets with large probability scores. We have 
used this bracket representation (0-l-C) as well. 
However, we have not used the combination 
strategy from Muhoz et al. (|1999D but in- 
stead used the strategy outlined in Tjong Kim 
Sang ( 2000 ): regard only the shortest possi- 
ble phrases between candidate open and close 
brackets as base noun phrases. 

An alternative representation for baseNPs 
has been put forward by Ramshaw and Mar- 
cus (119951) . They have defined baseNP recog- 
nition as a tagging task: words can be inside a 
baseNP (I) or outside a baseNP (O). In the case 
that one baseNP immediately follows another 
baseNP, the first word in the second baseNP 
receives tag B. Example: 

Ino early/ trading/ ino Hong/ Kong/ 
Mondays ,o gold/ waso quotedo ato 
$/ 366.50/ ans ounce/ .q 

This set of three tags is sufficient for encod- 
ing baseNP structures since these structures are 
nonrecursive and nonover lapping. 

Tjong Kim Sang ( 20001) outlines alternative 
versions of this tagging representation. First, 
the B tag can be used for the first word of ev- 
ery baseNP (I0B2 representation). Second, in- 
stead of the B tag an E tag can be used to 



mark the last word of a baseNP immediately 
before another baseNP (lOEl). And third, the 
E tag can be used for every noun phrase final 
word (I0E2). He used the Ramshaw and Mar- 
cus (|1995D representation as well (lOBl). We 
will use these four tagging representations and 
the 0+C representation for the system-internal 
combination experiments. 

2.3 Machine learning algorithms 

This section contains a brief description of the 
seven machine learning algorithms that we will 
apply to the baseNP identification task: AL- 
US, C5.0, IGTree, MaxEnt, MBL, MBSL and 
SNoW. 

ALLiS0 (Architecture for Learning Linguistic 
Structures) is a learning system which uses the- 
ory refinement in order to learn non-recursive 
NP and VP structures ( pejean, 200"o|) . ALLiS 
generates a regular expression grammar which 
describes the phrase structure (NP or VP). This 



grammar is then used by the CASS parser ( Ab 



ney, 1996). Following the principle of theory re- 
finement, the learning task is composed of two 
steps. The first step is the generation of an 
initial grammar. The generation of this gram- 
mar uses the notion of default values and some 
background knowledge which provides general 
expectations concerning the inner structure of 
NPs and VPs. This initial grammar provides 
an incomplete and/or incorrect analysis of the 
data. The second step is the refinement of this 
grammar. During this step, the validity of the 
rules of the initial grammar is checked and the 
rules are improved (refined) if necessary. This 
refinement relies on the use of two operations: 
the contextualization (in which contexts such a 
tag always belongs to the phrase) and lexical- 
ization (use of information about the words and 
not only about POS). 

C5.0P|, a commercial version of c4.5 ( Puin 



[an, 199^ ), performs top-down induction of de- 
cision trees (tdidt). On the basis of an in- 
stance base of examples, C5.0 constructs a deci- 
sion tree which compresses the classification in- 
formation in the instance base by exploiting dif- 
ferences in relative importance of different fea- 
tures. Instances are stored in the tree as paths 
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of connected nodes ending in leaves which con- 
tain classification information. Nodes are con- 
nected via arcs denoting feature values. Feature 
information gain (mutual information between 
features and class) is used to determine the or- 
der in which features are employed as tests at all 
levels of the tree (IQuinlan, 1993|) . With the full 
input representation (words and POS tags), we 
were not able to run complete experiments. We 
therefore experimented only with the POS tags 
(with a context of two left and right). We have 
used the default parameter setting with decision 
trees combined with value grouping. 

We have used a nearest neighbor algorithm 
(ib1-ig, here listed as MBL) and a decision tree 
algorith m (IGTree) from the TiM BL learning 
package ( Daelemans et al., 1999b ). Both algo- 
rithms store the training data and classify new 
items by choosing the most frequent classifica- 
tion among training items which are closest to 
this new item. Data items are represented as 
sets of feature- value pairs. Each feature receives 
a weight which is based on the amount of in- 
formation which it provides for computing the 
classification of the items in the training data. 
ibI-ig uses these weights for computing the dis- 
tance between a pair of data items and IGTree 
uses them for deciding which feature-value de- 
cisions should be made in the top nodes of the 
decision tree ( Daelemans et al., 1999b ). We 
will use their default parameters except for the 
IBI-IG parameter for the number of examined 
nearest neighbors (k) which we have set to 3 
( Daelemans et al., 1999a| ). The classifiers use a 
left and right context of four words and part- 
of-speech tags. For the four 10 representations 
we have used a second processing stage which 
used a smaller context but which included in- 
formation about the 10 tags predicted by the 
first processing phase ( p?jong Kim Sang, 20001 ). 

When building a classifier, one must gather 
evidence for predicting the correct class of an 
item from its context. The Maximum Entropy 
(MaxEnt) framework is especially suited for 
integrating evidence from various information 
sources. Frequencies of evidence/class combi- 
nations (called features) are extracted from a 
sample corpus and considered to be properties 
of the classification process. Attention is con- 
strained to models with these properties. The 
MaxEnt principle now demands that among all 



the probability distributions that obey these 
constraints, the most uniform is chosen. Dur- 
ing training, features are assigned weights in 
such a way that, given the MaxEnt principle, 
the training data is matched as well as possible. 
During evaluation it is tested which features are 
active (i.e. a feature is active when the context 
meets the requirements given by the feature). 
For every class the weights of the active fea- 
tures are combined and the best scoring class 
is chosen ( [Berger et al., 1996 ). For the classi- 
fier built here the surrounding words, their POS 
tags and baseNP tags predicted for the previous 
words are used as evidence. A mixture of simple 
features (consisting of one of the mentioned in- 
formation sources) and complex features (com- 
binations thereof) were used. The left context 
never exceeded 3 words, the right context was 
maximally 2 words. The model was calculated 
using existing software ([Dehaspe, 1997| ). 

MBSL ( [Argamon et al., 1999| ) uses POS data 
in order to identify baseNPs. Inference re- 
lies on a memory which contains all the oc- 
currences of POS sequences which appear in 
the beginning, or the end, of a baseNP (in- 
cluding complete phrases). These sequences 
may include a few context tags, up to a pre- 
specified max-context. During inference, MBSL 
tries to 'tile' each POS string with parts of 
noun-phrases from the memory. If the string 
could be fully covered by the tiles, it becomes 
part of a candidate list, ambiguities between 
candidates are resolved by a constraint propa- 
gation algorithm. Adding a context extends the 
possibilities for tiling, thereby giving more op- 
portunities to better candidates. The approach 
of MBSL to the problem of identifying baseNPs 
is sequence-based rather than word-based, that 
is, decisions are taken per POS sequence, or per 
candidate, but not for a single word. In addi- 
tion, the tiling process gives no preference to 
any direction in the sentence. The tiles may be 
of any length, up to the maximal length of a 
phrase in the training data, which gives MBSL 
a generalization power that compensates for the 
setup of using only POS tags. The results pre- 
sented here were obtained by optimizing MBSL 
parameters based on 5-fold CV on the training 
data. 

SNoW uses the Open/Close model, described 
in Muhoz et al. (1999). As is shown there, this 
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0+C 


97.72% 


98.04% 


92.03 


97.82% 


98.15% 


92.26 


96.89% 


97.49% 


89.37 


Majority 


98.04% 


98.20% 


92.82 


97.94% 


98.24% 


92.60 


97.70% 


97.99% 


91.92 



Table 1: The effects of system- internal combination by using different output representations. A 
straight-forward majority vote of the output yields better bracket accuracies and F^=i rates than 
any included individual classifier. The bracket accuracies in the columns O and C show what 
percentage of words was correctly classified as baseNP start, baseNP end or neither. 



model produced better results than the other 
paradigm evaluated there, the Inside/Outside 
paradigm. The Open/Close model consists of 
two SNoW predictors, one of which predicts the 
beginning of baseNPs (Open predictor), and the 
other predicts the end of the phrase (Close pre- 
dictor). The Open predictor is learned using 
SNoW ( Parlson et al., 1999| ; |Roth, 1998| ) as a 
function of features that utilize words and POS 
tags in the sentence and, given a new sentence, 
will predict for each word whether it is the first 
word in the phrase or not. For each Open, the 
Close predictor is learned using SNoW as a func- 
tion of features that utilize the words in the sen- 
tence, the POS tags and the open prediction. It 
will predict, for each word, whether it can be 
the end of the phrase, given the previously pre- 
dicted Open. Each pair of predicted Open and 
Close forms a candidate of a baseNP. These can- 
didates may conflict due to overlapping; at this 
stage, a graph-based constraint satisfaction al- 
gorithm that uses the confidence values SNoW 
associates with its predictions is employed. This 
algorithm ("the combinator") produces the list 
of the final baseNPs for each sentence. Details 
of SNoW, its application in shallow parsing and 
the combinator's algorithm are in Muhoz et al. 
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2.4 Combination techniques 

At two points in our noun phrase recognition 
process we will use system combination. We will 
start with system-internal combination: apply 
the same learning algorithm to variants of the 
task and combine the results. The approach 
we have chosen here is the same as in Tjong 



Kim Sang ( |2000 ): generate different variants 
of the task by using different representations 
of the output (lOBl, I0B2, lOEl, I0E2 and 
0+C). The five outputs will converted to the 
open bracket representation (O) and the close 
bracket representation (C) and after this, the 
most frequent of the five analyses of each word 
will chosen (majority voting, see below). We 
expect the systems which use this combination 
phase to perform better than their individual 
members ( Tjong Kim Sang, 2000 ). 



Our seven learners will generate different clas- 
sifications of the training data and we need to 
find out which combination techniques are most 
appropriate. For the system-external combi- 
nation experiment, we have evaluated different 
voting mechanisms, effectively the voting meth- 
ods as described in Van Halteren et al. ( |1998| ). 
In the first method each classification receives 
the same weight and the most frequent classifi- 
cation is chosen (Majority). The second method 
regards as the weight of each individual clas- 
sification algorithm its accuracy on some part 
of the data, the tuning data (TotPrecision). 
The third voting method computes the preci- 
sion of each assigned tag per classifier and uses 
this value as a weight for the classifier in those 
cases that it chooses the tag (TagPrecision) . 
The fourth method uses both the precision of 
each assigned tag and the recall of the com- 
peting tags (Precision- Recall). Finally, the fifth 
method uses not only a weight for the current 
classification but it also computes weights for 
other possible classifications. The other classi- 
fications are determined by examining the tun- 



ing data and registering the correct values for 
every pair of classifier results (pair-wise voting, 
see Van Halteren et al. ( 19981) for an elaborate 
explanation) . 

Apart from these five voting methods we have 
also processed the output streams with two clas- 
sifiers: MBL and IGTree. This approach is 
called classifier stacking. Like Van Halteren et 
al. ( 1998|) , we have used different input ver- 
sions: one containing only the classifier output 
and another containing both classifier output 
and a compressed representation of the data 
item under consideration. For the latter pur- 
pose we have used the part-of-speech tag of the 
current word. 

3 Results^ 

We want to find out whether system combi- 
nation could improve performance of baseNP 
recognition and, if this is the fact, we want to 
select the best combination technique. For this 
purpose we have performed an experiment with 
sections 15-18 of the WSJ part of the Penn Tree- 
bank as training data (211727 tokens) and sec- 
tion 21 as test data (40039 tokens). Like the 
data used by Ramshaw and Marcus ( |1995 ), this 
data was retagged by the Brill tagger in order 
to obtain realistic part-of-speech (POS) tags0. 
The data was segmented into baseNP parts and 
non-baseNP parts in a similar fashion as the 
data used by Ramshaw and Marcus ( 1995 ). Of 
the training data, only 90% was used for train- 
ing. The remaining 10% was used as tuning 
data for determining the weights of the combi- 
nation techniques. 

For three classifiers (MBL, MaxEnt and 
IGTree) we have used system-internal combi- 
nation. These learning algorithms have pro- 
cessed five different representations of the out- 
put (lOBl, I0B2, lOEl, I0E2 and 0+C) and 
the results have been combined with majority 
voting. The test data results can be found in 
Table |l|. In all cases, the combined results were 
better than that of the best included system. 

The results of ALLiS, C5.0, MBSL and SNoW 
have been converted to the O and the C repre- 



^Detailed results of our experiments are available on 



tittp://lcg-www.uia.ac.be/^erikt/npcombi/ 

^ The retagging was necessary to assure that the per- 
formance rates obtained here would be similar to rates 
obtained for texts for which no Treebank POS tags are 
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93.39 
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98.14% 
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93.24 


Decision Trees 








Tags 


98.24% 


98.35% 


93.39 


Tags + POS 


98.13% 


98.32% 


93.21 



Table 2: Bracket accuracies and F^=i scores 
for section WSJ 21 of the Penn Treebank with 
seven individual classifiers and combinations of 
them. Each combination performs better than 
its best individual member. The stacked classi- 
fiers without context information perform best. 



sentation. Together with the bracket represen- 
tations of the other three techniques, this gave 
us a total of seven O results and seven C results. 
These two data streams have been combined 
with the combination techniques described in 
After this, we built baseNPs from 



section 2.4 



the O and C results of each combination tech- 



nique, like described in section 2.2. The bracket 
accuracies and the F^=i scores for test data can 
be found in Table |2[ 

All combinations improve the results of the 
best individual classifier. The best results were 
obtained with a memory-based stacked classi- 
fier. This is different from the combination re- 
sults presented in Van Halteren et al. ( |1998| ), 
in which pairwise voting performed best. How- 
ever, in their later work stacked classifiers out- 



perform voting methods as well ( Van Halteren 



available. 



et al., to appear ). 
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Table 3: The overall performance of the majority voting combination of our best five systems 
(selected on tuning data performance) applied to the standard data set put forward by Ramshaw 
and Marcus (1995) together with an overview of earlier work. The accuracy scores indicate how 
often a word was classified correctly with the representation used (O, C or lOBl). The combined 
system outperforms all earlier reported results for this data set. 



Based on an earlier combination study 
( Tjong Kim Sang, 2000| ) we had expected the 
voting methods to do better. We suspect that 
their performance is below that of the stacked 
classifiers because the difference between the 
best and the worst individual system is larger 
than in our earlier study. We assume that the 
voting methods might perform better if they 
were only applied to the classifiers that per- 
form well on this task. In order to test this 
hypothesis, we have repeated the combination 
experiments with the best n classifiers, where 
n took values from 3 to 6 and the classifiers 
were ranked based on their performance on the 
tuning data. The best performances were ob- 



tained with five classifiers: F 



=93.44 for all 



five voting methods with the best stacked classi- 
fier reaching 93.24. With the top five classifiers, 
the voting methods outperform the best combi- 
nation with seven systems^. Adding extra clas- 
sification results to a good combination system 
should not make overall performance worse so 
it is clear that there is some room left for im- 
provement of our combination algorithms. 

We conclude that the best results in this 
task can be obtained with the simplest voting 
method, majority voting, applied to the best 
five of our classifiers. Our next task was to 
apply the combination approach to a standard 
data set so that we could compare our results 
with other work. For this purpose we have used 



® We are unaware of a good method for determining 
the significance of F^=i differences but we assume that 
this F^=i difference is not significant. However, we be- 
lieve that the fact that more combination methods per- 
form well, shows that it easier to get a good performance 
out of the best five systems than with all seven. 



the data put forward by Ramshaw and Marcus 
(1995). Again, only 90% of the training data 
was used for training while the remaining 10% 
was reserved for ranking the classifiers. The 
seven learners were trained with the same pa- 
rameters as in the previous experiment. Three 
of the classifiers (MBL, MaxEnt and IGTree) 
used system-internal combination by processing 
different output representations. 

The classifier output was converted to the 
O and the C representation. Based on the 
tuning data performance, the classifiers ALLiS, 
IGTREE, MaxEnt, MBL and SNoW were se- 
lected for being combined with majority vot- 
ing. After this, the resulting O and C repre- 
sentations were combined to baseNPs by using 
the method described in section The re- 

sults can be found in Table ^. Our combined 
system obtains an F^=i score of 93.86 which 
corresponds to an 8% error reduction compared 
with the best published result for this data set 
(93.26). 



4 Concluding remarks 

In this paper we have examined two methods for 
combining the results of machine learning algo- 
rithms for identifying base noun phrases. In the 
first method, the learner processed different out- 
put data representations and the results were 
combined by majority voting. This approach 
yielded better results than the best included 
classifier. In the second combination approach 
we have combined the results of seven learning 
systems (ALLiS, C5.0, IGTree, MaxEnt, MBL, 
MBSL and SNoW). Here we have tested dif- 
ferent combination methods. Each combination 



method outperformed the best individual learn- 
ing algorithm and a majority vote of the top 
five systems performed best. We have applied 
this approach of system-internal and system- 
external combination to a standard data set for 
base noun phrase identification and the perfor- 
mance of our system was better than any other 
published result for this data set. 

Our study shows that the combination meth- 
ods that we have tested are sensitive for the in- 
clusion of classifier results of poor quality. This 
leaves room for improvement of our results by 
evaluating other combinators. Another interest- 
ing approach which might lead to a better per- 
formance is taking into account more context 
information, for example by combining com- 
plete phrases instead of independent brackets. 
It would also be worthwhile to evaluate using 
more elaborate methods for building baseNPs 
out of open and close bracket candidates. 
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